25 May 2016

Overview

Motivation
Four general principles
Case study
Costs and benefits

Motivation

Universalism

'Communism'

Disinterestedness

Organized skepticism

Origins of scientific skepticism

Robert Boyle's vacuum pump

Documentation

'Communal witnessing'

Circumstances

Stodden's contemporary equivalents

Empirical Reproducibility

Computational Reproducibility

Statistical Reproducibility

Peng: 'For every X there is a Computational X'

Computational Biology

Computational Physics

Computational Chemistry

Computational Economics

Computational …

Computers are the new vacuum pump

Key ideas

Reproducibilty is necessary for scientific progress
Computers wrangle all the data, but also obscure it
Especially point-and-click actions
Technical solutions available in open source/format/data/access

Four general principles of reproducible research that have emerged in other fields

1. Make openly available the data and methods that generated the published results

✓ Plain text file formats

✓ persistent URLs

Victoria Stodden's Reproducible Research Standard

✓ Data: CC-0 (public domain)

✓ Code: MIT (no liability for reuse)

✓ Text/Figures/Media: CC-BY (attribution required)

2. Write scripts to conduct analyses

✗ Mouse gestures leave few traces that are enduring and accessible to others

✗ Easy to lose track of ah hoc changes in mouse-driven environments

✓ Everything should be scripted: data ingest, cleaning, analysis, visualizing, and reporting

✓ Scripts create a very high-resolution record of the research workflow in a plain text file that can be reused and inspected by others

3. Use version control to track changes

✗ Managing different versions of computer files is very challenging

✗ Poor version control leads to loosing track of the provenance of results

✓ VCS designed for software engineering are suitable for research code and text

✓ Commit history preserves a high-resolution, transparent record of the development of a file or set of files

✓ Enables remote collaborators to work together without overwriting each other’s work

4. Describe and archive the computational environment

✗ Minor changes in software can cripple complex research pipelines

✗ Managing software dependencies is tedious

✓ List of the key pieces software and their version numbers

✓ Archive a self-contained computational environment like a virtual machine or Linux container

Case Study

First principle

All files on figshare.com

Data in CSV format

Organised as an R package

Second principle

R & Rmarkdown documents

Third principle

All files tracked with Git, hosted on GitHub

Collaboration did not occur on GitHub because no co-authors used it

Fourth principle

Docker image and Dockerfile to contain RStudio, packages, code and external dependencies

Based on Rocker image and templates

Smaller than a VM

Extreme isolation

Continuous integration is helpful

.travis.yml
circle.yml

Research compendium +

README.md
R package & manuscript
VCS repository
code CI
environment CI

Costs & benefits

Costs

Time learning the tools

A lot of time

Built-in vs Bolt-on

Benefits

Comfort of knowing that I am right & have no secrets

Save time by reusing my previous code

Open data confers citation advantages, but magnitude is highly variable

Open Source community membership provides access to high-quality help

Two implications: Training

Two implications: Incentives

Summary

  • Open methods and materials, scripted workflow, version control and environment control are generic principles suitable for most fields of research
  • The specific details will change over time, but the principles will endure
  • For most people, the technical problems already have good solutions, the remaining challenge is cultural

Colophon